Part 1: Loading the Dataset

Initiate the data analysis process by loading the datasets. This step is crucial for setting the stage for all the analysis work that follows.

# Load necessary libraries
suppressPackageStartupMessages(library(tidyverse))

# Set the path to the CSV file
file_path <- "Human Development Index - Full.csv"

# Read the CSV file into a dataframe
hdi_data <- read.csv(file_path, header = TRUE, stringsAsFactors = FALSE)

# Examine the structure of the dataframe
str(hdi_data)

Explanation:

We start by loading the tidyverse library, which is a collection of R packages designed for data science.

We then set the path to the CSV file.

The read.csv() function is used to load the data into an R dataframe named hdi_data. We set stringsAsFactors = FALSE to ensure that string data is read in as character types instead of factors, which is usually more convenient for data cleaning.

The str() function provides a detailed structure of the dataframe, including the type of each column and the first few entries.

#Creating Interactive Table
library(DT)
datatable(hdi_data)

Explanation:

We first check if the DT package is installed and install it if it’s not already available.

We load the DT library to use the datatable() function.

The datatable() function creates an interactive table that allows for searching, filtering, and sorting within the R environment, typically within an R Markdown document or a Shiny app.

Part 2: Handling Missing Values

In this step, we’ll assess the missing values in our dataset and decide on a strategy for dealing with them. Strategies can include:

# Install and load necessary package
if (!requireNamespace("naniar", quietly = TRUE)) {
  install.packages("naniar")
}

# Define a threshold for maximum allowable percentage of missing data per column
threshold <- 0.5 # 50% threshold

# Remove columns with missing data above the threshold
hdi_data <- hdi_data %>%
  select_if(~ mean(is.na(.)) < threshold)

# Function to calculate the mode for categorical data
getmode <- function(v) {
  uniqv <- unique(na.omit(v))
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Impute missing values for numerical and categorical data
hdi_data <- hdi_data %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .))) %>%
  mutate(across(where(is.character), ~ ifelse(is.na(.), getmode(.), .)))

# Check the structure after handling missing values
str(hdi_data)

Explanation:

We use naniar for visualizing missing data.

We define a variable threshold that specifies the maximum proportion of missing values allowed in a column for it to be retained.

We filter out columns where the proportion of missing values exceeds the threshold using select_if.

We create a function getmode to compute the mode of a vector, which is used for imputing missing categorical data.

We impute missing values in numerical columns with the column mean and in categorical columns with the column mode using mutate and across.

Finally, we check the structure of our data with str to ensure that all columns have the appropriate data types and that missing values have been handled.

Part 3: Data Type Conversion

This phase focuses on the crucial aspect of converting data to the correct type to ensure accurate analysis.

# Ensure correct data types for each column

# For numerical columns - convert factors or characters to numeric if necessary
hdi_data <- hdi_data %>%
  mutate(across(where(is.character), ~ as.numeric(as.character(.)), .names = "numeric_{.col}"))

# For categorical columns - convert characters to factors if necessary
hdi_data <- hdi_data %>%
  mutate(across(where(is.character), as.factor, .names = "factor_{.col}"))

# For date columns - convert characters to Date objects if necessary
# hdi_data <- hdi_data %>%
#   mutate(Date_Column = as.Date(Date_Column, format="%Y-%m-%d"))

# Check the updated structure after conversions
str(hdi_data)

Explanation:

We use mutate() and across() functions from the dplyr package to convert the data types.

For numerical columns, if they are mistakenly read as factors or characters, we convert them to numeric using as.numeric() after ensuring they are characters with as.character(). This step is crucial to prevent the unintended conversion of factor levels to their underlying integer codes.

For categorical data, we transform character columns to factors with as.factor(), which facilitates various analyses, especially in statistical modeling.

Date columns, if present, are converted from characters to Date objects with as.Date(), applying the appropriate date format.

Post-conversion, we verify the dataframe’s structure with str() to confirm the data types are correctly adjusted.

It’s noteworthy that converting characters to numeric will yield NA for non-numeric strings. It is vital to convert only those columns that are assuredly numeric values in string format.

Visualization

Loading the further cleaned file, wherein every column name is in lowercase and without any spaces.

# Set the path to the CSV file
file_path <- "cleaned_HDI_data.csv"

# Read the CSV file into a dataframe
cleaned_hdi_data <- read.csv(file_path, header = TRUE, stringsAsFactors = FALSE)

  1. Interactive Scatter Plot of CO2 Emissions vs. Material Footprint

This plot will show the relationship between CO2 emissions per capita and the material footprint per capita for the latest available year (2021).

library(ggplot2)
library(plotly)

# Assuming 'cleaned_hdi_data' is your dataframe
p <- ggplot(cleaned_hdi_data, aes(x = carbondioxideemissionspercapitaproductiontonnes2021, 
                                  y = materialfootprintpercapitatonnes2021)) +
  geom_point(aes(text = paste("Country:", factor_country)), size = 2) + 
  theme_minimal() +
  labs(title = "CO2 Emissions vs. Material Footprint (2021)",
       x = "CO2 Emissions Per Capita (tonnes)",
       y = "Material Footprint Per Capita (tonnes)")

# Convert to interactive plot
p_interactive <- ggplotly(p, tooltip = "text")

p_interactive

  1. Interactive 3D Scatter Plot

This plot will create a 3D scatter plot of life expectancy, education, and GNI per capita.

library(plotly)

# 3D scatter plot for life expectancy, education, and GNI per capita for the year 2021
p <- plot_ly(cleaned_hdi_data, 
             x = ~expectedyearsofschoolingmale2021, 
             y = ~lifeexpectancyatbirth2021, 
             z = ~grossnationalincomepercapita2021,  # Updated with the correct column name
             type = 'scatter3d', 
             mode = 'markers',
             marker = list(size = 3),
             text = ~country)  # Updated with the correct column name for countries

p

  1. Interactive Bar Chart of GNI Per Capita by Country for 2021

This section will focus on creating an interactive bar chart that represents the Gross National Income (GNI) per capita for different countries in the year 2021.

library(plotly)

# Create a bar chart for GNI per capita for the year 2021
gni_bar_chart <- plot_ly(cleaned_hdi_data, x = ~country, y = ~grossnationalincomepercapita2021, type = 'bar', text = ~country)

gni_bar_chart

Description:

This plot is an interactive bar chart showing the Gross National Income (GNI) per capita for different countries in the year 2021. Each bar represents a country, and the height indicates the GNI per capita value.

  1. Interactive Line Plot of Life Expectancy Over Time by Country

This visualization will present an interactive line plot, illustrating the trends in life expectancy across various countries over time.

library(plotly)
library(dplyr)
library(tidyr)

# Pivot the life expectancy data to have a long format
life_expectancy_data <- cleaned_hdi_data %>%
  select(country, starts_with("lifeexpectancyatbirth")) %>%
  gather(key = "year", value = "life_expectancy", -country) %>%
  mutate(year = as.numeric(gsub("lifeexpectancyatbirth", "", year)))

# Create the interactive line plot
life_expectancy_plot <- plot_ly(life_expectancy_data, x = ~year, y = ~life_expectancy, color = ~country, type = 'scatter', mode = 'lines')

life_expectancy_plot

Description:

This visualization shows the trend of life expectancy at birth for all countries over the available years. Each line represents a different country, and the user can hover over the plot to see the exact values.

  1. Interactive 3D Scatter Plot of HDI Components

This interactive visualization will plot the components of the Human Development Index (HDI) in a three-dimensional space, allowing for an immersive exploration of the data.

library(plotly)

# Create a 3D scatter plot with HDI components: Life Expectancy, Education, and GNI
hdi_3d_scatter <- plot_ly(cleaned_hdi_data, 
                          x = ~lifeexpectancyatbirth2021, 
                          y = ~meanyearsofschooling2021, 
                          z = ~grossnationalincomepercapita2021, 
                          color = ~humandevelopmentindex2021, 
                          text = ~country,
                          type = 'scatter3d', 
                          mode = 'markers')

hdi_3d_scatter

Description:

This plot creates a 3D scatter diagram that visualizes the three components of the Human Development Index (HDI): life expectancy at birth, mean years of schooling, and GNI per capita. The color of the markers represents the HDI value for 2021, offering a vivid depiction of each country’s development status.

  1. Interactive Choropleth Map of HDI by Country

The forthcoming interactive map will illustrate the Human Development Index (HDI) across countries, providing a geographical perspective on global human development levels.

  1. Interactive Choropleth Map of HDI by Country

The forthcoming interactive map will illustrate the Human Development Index (HDI) across countries, providing a geographical perspective on global human development levels.

library(plotly)

hdi_map <- plot_geo(cleaned_hdi_data, locationmode = 'ISO-3', locations = ~iso3, z = ~humandevelopmentindex2021, text = ~country, colors = "Blues")

hdi_map

Description:

This visualization is an interactive choropleth map showing the Human Development Index by country for the year 2021. Countries are shaded based on their HDI value, providing a quick visual indication of human development across the globe. The map’s interactivity allows for engagement with the data, offering insights into the HDI of each country upon interaction.

Trend Analysis with Linear Models:

How has the Human Development Index (HDI) of a specific country (or a set of countries) changed over the years? This question can be addressed using linear regression to model the trend of HDI over time.

library(shiny)
library(ggplot2)
library(dplyr)
library(tidyr)
hdi_data <- read.csv("cleaned_HDI_data.csv")
ui <- fluidPage(
    titlePanel("HDI Trend Analysis"),
    sidebarLayout(
        sidebarPanel(
            selectInput("countryInput", 
                        "Select Country:", 
                        choices = unique(hdi_data$country),
                        selected = "United States",
                        multiple = TRUE),
            sliderInput("yearRange", 
                        "Select Year Range:",
                        min = 1990, 
                        max = 2021, 
                        value = c(1990, 2021),
                        step = 1)
        ),
        mainPanel(
            plotOutput("hdiTrendPlot")
        )
    )
)

server <- function(input, output) {
    output$hdiTrendPlot <- renderPlot({
        filtered_data <- hdi_data %>%
            filter(country %in% input$countryInput) %>%
            select(c("country", paste0("humandevelopmentindex", 1990:2021))) %>%
            pivot_longer(cols = starts_with("humandevelopmentindex"), 
                         names_to = "year", 
                         values_to = "hdi") %>%
            mutate(year = as.numeric(sub("humandevelopmentindex", "", year))) %>%
            filter(year >= input$yearRange[1] & year <= input$yearRange[2])

        ggplot(filtered_data, aes(x = year, y = hdi, group = country, color = country)) +
            geom_line() +
            geom_point() +
            theme_minimal() +
            labs(title = "HDI Trend Over Time",
                 x = "Year",
                 y = "Human Development Index (HDI)") +
            theme(legend.title = element_blank())
    })
}
shinyApp(ui, server)
Shiny applications not supported in static R Markdown documents

Citation:

This analysis was conducted with reference to the Shiny app available at https://shiny.posit.co/r/articles/share/shinyapps/, accessed on 21st November 2023.